The COVID-19 pandemic has transformed the world in unprecedented ways, making it one of the most significant events of the last decade. Not only has it deeply affected public health and the economy as a whole, but it has also significantly impacted daily life. This analysis will specifically focus on the European continent, which has experienced several waves of infections and has put in place a number of interventions designed to fight the virus, stop it from spreading, and enhance public safety. Thus, this project's primary objective is to examine and illustrate pandemic data in order to gain a deeper understanding of the dynamics and effects of COVID-19.
The dataset employed comes from Our World in Data (https://covid.ourworldindata.org/data/owid-covid-data.csv) and includes a wide range of features, such as confirmed cases, deaths, vaccination rates, government policies, and other relevant indicators.
Some important questions this analysis seeks to answer are:
This first section is devoted to the entire procedure needed to clean up the data and prepare them for subsequent analysis. To begin, the necessary libraries are imported. Pandas and numpy are used for data manipulation, matplotlib and seaborn for visualization, and sklearn for machine learning techniques. Subsequently, the locally existence of the dataset file is verified. If it does not exist, it is downloaded from the internet and loaded into a pandas DataFrame.
# Importing required libraries
import os
os.environ['LOKY_MAX_CPU_COUNT'] = '10'
import numpy as np
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, LassoCV, Ridge, Lasso
from sklearn.metrics import r2_score, adjusted_rand_score, mean_squared_error, mean_absolute_error, silhouette_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.feature_selection import SelectKBest, f_regression, RFE, SelectFromModel
from sklearn.ensemble import RandomForestRegressor
from sklearn.decomposition import PCA
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from sklearn.cluster import KMeans
# Define file path
path = "owid-covid-data.csv"
# Check if the dataset file exists locally, if not, download it
if not os.path.exists(path):
import urllib.request
url = "https://covid.ourworldindata.org/data/owid-covid-data.csv"
urllib.request.urlretrieve(url, path)
# Load the dataset from the CSV file
data = pd.read_csv(path)
At first glance, the dataset presents more than 300 thousand observations including 67 features. It includes covid data covering all countries worldwide. However, already from an initial overview it is evident that there are missing values and further investigation is required.
print('Number of instances = %d' % (data.shape[0]))
print('Number of attributes = %d' % (data.shape[1]))
data
Number of instances = 383744 Number of attributes = 67
| iso_code | continent | location | date | total_cases | new_cases | new_cases_smoothed | total_deaths | new_deaths | new_deaths_smoothed | ... | male_smokers | handwashing_facilities | hospital_beds_per_thousand | life_expectancy | human_development_index | population | excess_mortality_cumulative_absolute | excess_mortality_cumulative | excess_mortality | excess_mortality_cumulative_per_million | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AFG | Asia | Afghanistan | 2020-01-05 | NaN | 0.0 | NaN | NaN | 0.0 | NaN | ... | NaN | 37.746 | 0.5 | 64.83 | 0.511 | 41128772.0 | NaN | NaN | NaN | NaN |
| 1 | AFG | Asia | Afghanistan | 2020-01-06 | NaN | 0.0 | NaN | NaN | 0.0 | NaN | ... | NaN | 37.746 | 0.5 | 64.83 | 0.511 | 41128772.0 | NaN | NaN | NaN | NaN |
| 2 | AFG | Asia | Afghanistan | 2020-01-07 | NaN | 0.0 | NaN | NaN | 0.0 | NaN | ... | NaN | 37.746 | 0.5 | 64.83 | 0.511 | 41128772.0 | NaN | NaN | NaN | NaN |
| 3 | AFG | Asia | Afghanistan | 2020-01-08 | NaN | 0.0 | NaN | NaN | 0.0 | NaN | ... | NaN | 37.746 | 0.5 | 64.83 | 0.511 | 41128772.0 | NaN | NaN | NaN | NaN |
| 4 | AFG | Asia | Afghanistan | 2020-01-09 | NaN | 0.0 | NaN | NaN | 0.0 | NaN | ... | NaN | 37.746 | 0.5 | 64.83 | 0.511 | 41128772.0 | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 383739 | ZWE | Africa | Zimbabwe | 2024-02-28 | 266355.0 | 0.0 | 0.714 | 5739.0 | 0.0 | 0.286 | ... | 30.7 | 36.791 | 1.7 | 61.49 | 0.571 | 16320539.0 | NaN | NaN | NaN | NaN |
| 383740 | ZWE | Africa | Zimbabwe | 2024-02-29 | 266355.0 | 0.0 | 0.714 | 5739.0 | 0.0 | 0.286 | ... | 30.7 | 36.791 | 1.7 | 61.49 | 0.571 | 16320539.0 | NaN | NaN | NaN | NaN |
| 383741 | ZWE | Africa | Zimbabwe | 2024-03-01 | 266355.0 | 0.0 | 0.714 | 5739.0 | 0.0 | 0.286 | ... | 30.7 | 36.791 | 1.7 | 61.49 | 0.571 | 16320539.0 | NaN | NaN | NaN | NaN |
| 383742 | ZWE | Africa | Zimbabwe | 2024-03-02 | 266355.0 | 0.0 | 0.714 | 5739.0 | 0.0 | 0.286 | ... | 30.7 | 36.791 | 1.7 | 61.49 | 0.571 | 16320539.0 | NaN | NaN | NaN | NaN |
| 383743 | ZWE | Africa | Zimbabwe | 2024-03-03 | 266358.0 | 3.0 | 0.429 | 5740.0 | 1.0 | 0.143 | ... | 30.7 | 36.791 | 1.7 | 61.49 | 0.571 | 16320539.0 | NaN | NaN | NaN | NaN |
383744 rows × 67 columns
This analysis will mainly focus on the European continent, which is the reason why the data were initially filtered to consider only countries located in Europe. All subsequent data cleaning, preprocessing operations, as well as analysis, will be carried out on this dataset.
# Filtering the dataset to focus on European countries
# Remove unnecessary columns and reset the index
europe_data = data.loc[data['continent'] == 'Europe'].reset_index()
europe_data = europe_data.drop(['index', 'continent', 'tests_units'], axis=1)
print('Number of instances = %d' % (europe_data.shape[0]))
print('Number of attributes = %d' % (europe_data.shape[1]))
europe_data
Number of instances = 82841 Number of attributes = 65
| iso_code | location | date | total_cases | new_cases | new_cases_smoothed | total_deaths | new_deaths | new_deaths_smoothed | total_cases_per_million | ... | male_smokers | handwashing_facilities | hospital_beds_per_thousand | life_expectancy | human_development_index | population | excess_mortality_cumulative_absolute | excess_mortality_cumulative | excess_mortality | excess_mortality_cumulative_per_million | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ALB | Albania | 2020-01-05 | NaN | 0.0 | NaN | NaN | 0.0 | NaN | NaN | ... | 51.2 | NaN | 2.89 | 78.57 | 0.795 | 2842318.0 | NaN | NaN | NaN | NaN |
| 1 | ALB | Albania | 2020-01-06 | NaN | 0.0 | NaN | NaN | 0.0 | NaN | NaN | ... | 51.2 | NaN | 2.89 | 78.57 | 0.795 | 2842318.0 | NaN | NaN | NaN | NaN |
| 2 | ALB | Albania | 2020-01-07 | NaN | 0.0 | NaN | NaN | 0.0 | NaN | NaN | ... | 51.2 | NaN | 2.89 | 78.57 | 0.795 | 2842318.0 | NaN | NaN | NaN | NaN |
| 3 | ALB | Albania | 2020-01-08 | NaN | 0.0 | NaN | NaN | 0.0 | NaN | NaN | ... | 51.2 | NaN | 2.89 | 78.57 | 0.795 | 2842318.0 | NaN | NaN | NaN | NaN |
| 4 | ALB | Albania | 2020-01-09 | NaN | 0.0 | NaN | NaN | 0.0 | NaN | NaN | ... | 51.2 | NaN | 2.89 | 78.57 | 0.795 | 2842318.0 | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 82836 | OWID_WLS | Wales | 2023-07-08 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 3170000.0 | NaN | NaN | NaN | NaN |
| 82837 | OWID_WLS | Wales | 2023-07-09 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 3170000.0 | NaN | NaN | NaN | NaN |
| 82838 | OWID_WLS | Wales | 2023-07-10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 3170000.0 | NaN | NaN | NaN | NaN |
| 82839 | OWID_WLS | Wales | 2023-07-11 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 3170000.0 | NaN | NaN | NaN | NaN |
| 82840 | OWID_WLS | Wales | 2023-07-12 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 3170000.0 | NaN | NaN | NaN | NaN |
82841 rows × 65 columns
# Displaying information about the European dataset
europe_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 82841 entries, 0 to 82840 Data columns (total 65 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 iso_code 82841 non-null object 1 location 82841 non-null object 2 date 82841 non-null object 3 total_cases 74734 non-null float64 4 new_cases 76798 non-null float64 5 new_cases_smoothed 76543 non-null float64 6 total_deaths 71751 non-null float64 7 new_deaths 76798 non-null float64 8 new_deaths_smoothed 76543 non-null float64 9 total_cases_per_million 74734 non-null float64 10 new_cases_per_million 76798 non-null float64 11 new_cases_smoothed_per_million 76543 non-null float64 12 total_deaths_per_million 71751 non-null float64 13 new_deaths_per_million 76798 non-null float64 14 new_deaths_smoothed_per_million 76543 non-null float64 15 reproduction_rate 45775 non-null float64 16 icu_patients 26624 non-null float64 17 icu_patients_per_million 26624 non-null float64 18 hosp_patients 31652 non-null float64 19 hosp_patients_per_million 31652 non-null float64 20 weekly_icu_admissions 8152 non-null float64 21 weekly_icu_admissions_per_million 8152 non-null float64 22 weekly_hosp_admissions 18420 non-null float64 23 weekly_hosp_admissions_per_million 18420 non-null float64 24 total_tests 27095 non-null float64 25 new_tests 26201 non-null float64 26 total_tests_per_thousand 27095 non-null float64 27 new_tests_per_thousand 26201 non-null float64 28 new_tests_smoothed 32310 non-null float64 29 new_tests_smoothed_per_thousand 32310 non-null float64 30 positive_rate 30424 non-null float64 31 tests_per_case 30281 non-null float64 32 total_vaccinations 26665 non-null float64 33 people_vaccinated 24874 non-null float64 34 people_fully_vaccinated 23956 non-null float64 35 total_boosters 17294 non-null float64 36 new_vaccinations 23051 non-null float64 37 new_vaccinations_smoothed 44561 non-null float64 38 total_vaccinations_per_hundred 26665 non-null float64 39 people_vaccinated_per_hundred 24874 non-null float64 40 people_fully_vaccinated_per_hundred 23956 non-null float64 41 total_boosters_per_hundred 17294 non-null float64 42 new_vaccinations_smoothed_per_million 44561 non-null float64 43 new_people_vaccinated_smoothed 44235 non-null float64 44 new_people_vaccinated_smoothed_per_hundred 44235 non-null float64 45 stringency_index 48048 non-null float64 46 population_density 73047 non-null float64 47 median_age 60887 non-null float64 48 aged_65_older 60887 non-null float64 49 aged_70_older 59367 non-null float64 50 gdp_per_capita 63927 non-null float64 51 extreme_poverty 45665 non-null float64 52 cardiovasc_death_rate 62407 non-null float64 53 diabetes_prevalence 66967 non-null float64 54 female_smokers 60887 non-null float64 55 male_smokers 60887 non-null float64 56 handwashing_facilities 4560 non-null float64 57 hospital_beds_per_thousand 65447 non-null float64 58 life_expectancy 73047 non-null float64 59 human_development_index 63927 non-null float64 60 population 82841 non-null float64 61 excess_mortality_cumulative_absolute 7181 non-null float64 62 excess_mortality_cumulative 7181 non-null float64 63 excess_mortality 7181 non-null float64 64 excess_mortality_cumulative_per_million 7181 non-null float64 dtypes: float64(62), object(3) memory usage: 41.1+ MB
As can be seen from the output right above, the dataset consists mainly of numeric features, with the exception of the country name and the ISO code identifying it. In addition, there is a column containing temporal information that requires special attention. This column will be converted to the datetime format. The advantage of converting the date column to datetime is that it allows more efficient temporal management and analysis. This format will be useful for future data filtering, sorting and aggregation operations.
# Converting the 'date' column to datetime format
europe_data['date'] = pd.to_datetime(data['date'])
europe_data['date'].info()
<class 'pandas.core.series.Series'> RangeIndex: 82841 entries, 0 to 82840 Series name: date Non-Null Count Dtype -------------- ----- 82841 non-null datetime64[ns] dtypes: datetime64[ns](1) memory usage: 647.3 KB
# Listing unique ISO codes and locations
europe_data['iso_code'].unique()
array(['ALB', 'AND', 'AUT', 'BLR', 'BEL', 'BIH', 'BGR', 'HRV', 'CYP',
'CZE', 'DNK', 'OWID_ENG', 'EST', 'FRO', 'FIN', 'FRA', 'DEU', 'GIB',
'GRC', 'GGY', 'HUN', 'ISL', 'IRL', 'IMN', 'ITA', 'JEY', 'OWID_KOS',
'LVA', 'LIE', 'LTU', 'LUX', 'MLT', 'MDA', 'MCO', 'MNE', 'NLD',
'MKD', 'OWID_NIR', 'NOR', 'POL', 'PRT', 'ROU', 'RUS', 'SMR',
'OWID_SCT', 'SRB', 'SVK', 'SVN', 'ESP', 'SWE', 'CHE', 'UKR', 'GBR',
'VAT', 'OWID_WLS'], dtype=object)
europe_data['location'].unique()
array(['Albania', 'Andorra', 'Austria', 'Belarus', 'Belgium',
'Bosnia and Herzegovina', 'Bulgaria', 'Croatia', 'Cyprus',
'Czechia', 'Denmark', 'England', 'Estonia', 'Faeroe Islands',
'Finland', 'France', 'Germany', 'Gibraltar', 'Greece', 'Guernsey',
'Hungary', 'Iceland', 'Ireland', 'Isle of Man', 'Italy', 'Jersey',
'Kosovo', 'Latvia', 'Liechtenstein', 'Lithuania', 'Luxembourg',
'Malta', 'Moldova', 'Monaco', 'Montenegro', 'Netherlands',
'North Macedonia', 'Northern Ireland', 'Norway', 'Poland',
'Portugal', 'Romania', 'Russia', 'San Marino', 'Scotland',
'Serbia', 'Slovakia', 'Slovenia', 'Spain', 'Sweden', 'Switzerland',
'Ukraine', 'United Kingdom', 'Vatican', 'Wales'], dtype=object)
print('Total number of countries:', europe_data['location'].nunique())
Total number of countries: 55
In the process of feature selection, a preliminary exploration of the present columns was performed in order to assess the extent of available information.
Subsequently, features considered crucial for analysis were selected. These included metrics directly related to the pandemic, such as the total number of cases and deaths per million people, the cumulative excess mortality rate per million and indicators related to hospital pressure.
In addition to this, socio-economic and demographic variables were considered significant for understanding the context of the response to the crisis. These included GDP per capita, population density, median age, proportion of elderly, extreme poverty level and smoking habits.
Public health-related characteristics were also included, such as the availability of handwashing facilities and the number of hospital beds per thousand people, crucial factors in the management of health emergencies.
This process of targeted selection of characteristics is intended to focus the analysis on the most significant and informative factors, seeking to provide an in-depth view of the situation resulting from the COVID-19 pandemic.
# Listing column names of the European dataset
europe_data.columns.tolist()
['iso_code', 'location', 'date', 'total_cases', 'new_cases', 'new_cases_smoothed', 'total_deaths', 'new_deaths', 'new_deaths_smoothed', 'total_cases_per_million', 'new_cases_per_million', 'new_cases_smoothed_per_million', 'total_deaths_per_million', 'new_deaths_per_million', 'new_deaths_smoothed_per_million', 'reproduction_rate', 'icu_patients', 'icu_patients_per_million', 'hosp_patients', 'hosp_patients_per_million', 'weekly_icu_admissions', 'weekly_icu_admissions_per_million', 'weekly_hosp_admissions', 'weekly_hosp_admissions_per_million', 'total_tests', 'new_tests', 'total_tests_per_thousand', 'new_tests_per_thousand', 'new_tests_smoothed', 'new_tests_smoothed_per_thousand', 'positive_rate', 'tests_per_case', 'total_vaccinations', 'people_vaccinated', 'people_fully_vaccinated', 'total_boosters', 'new_vaccinations', 'new_vaccinations_smoothed', 'total_vaccinations_per_hundred', 'people_vaccinated_per_hundred', 'people_fully_vaccinated_per_hundred', 'total_boosters_per_hundred', 'new_vaccinations_smoothed_per_million', 'new_people_vaccinated_smoothed', 'new_people_vaccinated_smoothed_per_hundred', 'stringency_index', 'population_density', 'median_age', 'aged_65_older', 'aged_70_older', 'gdp_per_capita', 'extreme_poverty', 'cardiovasc_death_rate', 'diabetes_prevalence', 'female_smokers', 'male_smokers', 'handwashing_facilities', 'hospital_beds_per_thousand', 'life_expectancy', 'human_development_index', 'population', 'excess_mortality_cumulative_absolute', 'excess_mortality_cumulative', 'excess_mortality', 'excess_mortality_cumulative_per_million']
# Selecting relevant features for analysis
europe_data = europe_data[['iso_code', 'location', 'date', 'total_cases_per_million', 'total_deaths_per_million',
'excess_mortality_cumulative_per_million', 'icu_patients_per_million', 'hosp_patients_per_million',
'weekly_icu_admissions_per_million', 'weekly_hosp_admissions_per_million', 'stringency_index',
'reproduction_rate', 'people_vaccinated', 'people_fully_vaccinated', 'population_density',
'median_age', 'aged_65_older', 'aged_70_older', 'gdp_per_capita', 'extreme_poverty',
'female_smokers', 'male_smokers', 'handwashing_facilities', 'hospital_beds_per_thousand',
'life_expectancy', 'human_development_index']]
europe_data
| iso_code | location | date | total_cases_per_million | total_deaths_per_million | excess_mortality_cumulative_per_million | icu_patients_per_million | hosp_patients_per_million | weekly_icu_admissions_per_million | weekly_hosp_admissions_per_million | ... | aged_65_older | aged_70_older | gdp_per_capita | extreme_poverty | female_smokers | male_smokers | handwashing_facilities | hospital_beds_per_thousand | life_expectancy | human_development_index | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ALB | Albania | 2020-01-05 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 13.188 | 8.643 | 11803.431 | 1.1 | 7.1 | 51.2 | NaN | 2.89 | 78.57 | 0.795 |
| 1 | ALB | Albania | 2020-01-06 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 13.188 | 8.643 | 11803.431 | 1.1 | 7.1 | 51.2 | NaN | 2.89 | 78.57 | 0.795 |
| 2 | ALB | Albania | 2020-01-07 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 13.188 | 8.643 | 11803.431 | 1.1 | 7.1 | 51.2 | NaN | 2.89 | 78.57 | 0.795 |
| 3 | ALB | Albania | 2020-01-08 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 13.188 | 8.643 | 11803.431 | 1.1 | 7.1 | 51.2 | NaN | 2.89 | 78.57 | 0.795 |
| 4 | ALB | Albania | 2020-01-09 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 13.188 | 8.643 | 11803.431 | 1.1 | 7.1 | 51.2 | NaN | 2.89 | 78.57 | 0.795 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 82836 | OWID_WLS | Wales | 2021-11-29 | NaN | NaN | NaN | 0.000 | 0.000 | NaN | 0.631 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 82837 | OWID_WLS | Wales | 2021-11-30 | NaN | NaN | NaN | 0.000 | 0.000 | NaN | 0.631 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 82838 | OWID_WLS | Wales | 2021-12-01 | NaN | NaN | NaN | 0.000 | 0.000 | NaN | 0.631 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 82839 | OWID_WLS | Wales | 2021-12-02 | NaN | NaN | NaN | 0.000 | 0.000 | NaN | 0.631 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 82840 | OWID_WLS | Wales | 2021-12-03 | NaN | NaN | NaN | 1.577 | 44.795 | NaN | 1.577 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
82841 rows × 26 columns
The next and fundamental operation is the handling of missing values. To do this, an analysis was carried out to assess the presence of missing values for each variable. This made it possible to obtain an overview of the distribution of missing values within the dataset. To provide a more immediate and graphical visualisation, a heatmap was created. This tool facilitated the identification of columns with a high number of missing values. In addition, the percentage value of missing values for each column was calculated in order to determine which columns to keep in the dataset. Only columns with less than 50 per cent missing values were retained.
# Displaying the number of missing values for each variable
print('Number of missing values:')
for col in europe_data.columns:
print('\t%s: %d' % (col,europe_data[col].isna().sum()))
Number of missing values: iso_code: 0 location: 0 date: 0 total_cases_per_million: 8107 total_deaths_per_million: 11090 excess_mortality_cumulative_per_million: 75660 icu_patients_per_million: 56217 hosp_patients_per_million: 51189 weekly_icu_admissions_per_million: 74689 weekly_hosp_admissions_per_million: 64421 stringency_index: 34793 reproduction_rate: 37066 people_vaccinated: 57967 people_fully_vaccinated: 58885 population_density: 9794 median_age: 21954 aged_65_older: 21954 aged_70_older: 23474 gdp_per_capita: 18914 extreme_poverty: 37176 female_smokers: 21954 male_smokers: 21954 handwashing_facilities: 78281 hospital_beds_per_thousand: 17394 life_expectancy: 9794 human_development_index: 18914
# Creating a heatmap to visualize missing values
plt.figure(figsize=(20, 10))
sns.heatmap(europe_data.isna(), cbar=False)
plt.title("Missing Heatmap", fontsize=15)
plt.show()
# Calculating the percentage of missing values for each column
missing_values_percentage = (europe_data.isna().sum() / len(europe_data)) * 100
missing_values_percentage.round(2)
iso_code 0.00 location 0.00 date 0.00 total_cases_per_million 9.79 total_deaths_per_million 13.39 excess_mortality_cumulative_per_million 91.33 icu_patients_per_million 67.86 hosp_patients_per_million 61.79 weekly_icu_admissions_per_million 90.16 weekly_hosp_admissions_per_million 77.76 stringency_index 42.00 reproduction_rate 44.74 people_vaccinated 69.97 people_fully_vaccinated 71.08 population_density 11.82 median_age 26.50 aged_65_older 26.50 aged_70_older 28.34 gdp_per_capita 22.83 extreme_poverty 44.88 female_smokers 26.50 male_smokers 26.50 handwashing_facilities 94.50 hospital_beds_per_thousand 21.00 life_expectancy 11.82 human_development_index 22.83 dtype: float64
# Selecting columns with less than 50% missing values
columns_to_keep = missing_values_percentage[missing_values_percentage < 50].index
# Update the dataset with selected columns
europe_data = europe_data[columns_to_keep]
europe_data
| iso_code | location | date | total_cases_per_million | total_deaths_per_million | stringency_index | reproduction_rate | population_density | median_age | aged_65_older | aged_70_older | gdp_per_capita | extreme_poverty | female_smokers | male_smokers | hospital_beds_per_thousand | life_expectancy | human_development_index | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ALB | Albania | 2020-01-05 | NaN | NaN | 0.0 | NaN | 104.871 | 38.0 | 13.188 | 8.643 | 11803.431 | 1.1 | 7.1 | 51.2 | 2.89 | 78.57 | 0.795 |
| 1 | ALB | Albania | 2020-01-06 | NaN | NaN | 0.0 | NaN | 104.871 | 38.0 | 13.188 | 8.643 | 11803.431 | 1.1 | 7.1 | 51.2 | 2.89 | 78.57 | 0.795 |
| 2 | ALB | Albania | 2020-01-07 | NaN | NaN | 0.0 | NaN | 104.871 | 38.0 | 13.188 | 8.643 | 11803.431 | 1.1 | 7.1 | 51.2 | 2.89 | 78.57 | 0.795 |
| 3 | ALB | Albania | 2020-01-08 | NaN | NaN | 0.0 | NaN | 104.871 | 38.0 | 13.188 | 8.643 | 11803.431 | 1.1 | 7.1 | 51.2 | 2.89 | 78.57 | 0.795 |
| 4 | ALB | Albania | 2020-01-09 | NaN | NaN | 0.0 | NaN | 104.871 | 38.0 | 13.188 | 8.643 | 11803.431 | 1.1 | 7.1 | 51.2 | 2.89 | 78.57 | 0.795 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 82836 | OWID_WLS | Wales | 2021-11-29 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 82837 | OWID_WLS | Wales | 2021-11-30 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 82838 | OWID_WLS | Wales | 2021-12-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 82839 | OWID_WLS | Wales | 2021-12-02 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 82840 | OWID_WLS | Wales | 2021-12-03 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
82841 rows × 18 columns
Noting the still substantial presence of missing values, additional data cleaning was conducted, removing countries with variables characterized by more than 10,000 missing values. This allowed to focus on a more relevant subset of European countries, which had a significant amount of data to represent them. From the initial 55 countries, only 40 remained after these cleaning operations.
# Visualizing missing values in the filtered dataset
plt.figure(figsize=(20, 10))
sns.heatmap(europe_data.isna(), cbar=False)
plt.title("Missing Heatmap", fontsize=15)
plt.show()
# Counting missing values for each country
missing_values_count_list = []
for iso_code in europe_data['iso_code'].unique():
country_df = europe_data[europe_data['iso_code'] == iso_code]
missing_values = country_df.drop(['iso_code', 'location'], axis=1).isnull().sum().sum()
missing_values_country_df = pd.DataFrame({'iso_code': [iso_code], 'missing_values': [missing_values]})
missing_values_count_list.append(missing_values_country_df)
missing_values_count = pd.concat(missing_values_count_list, ignore_index=True)
print(missing_values_count)
iso_code missing_values 0 ALB 1073 1 AND 10192 2 AUT 1046 3 BLR 2607 4 BEL 1021 5 BIH 1072 6 BGR 1111 7 HRV 1062 8 CYP 2608 9 CZE 2650 10 DNK 1046 11 OWID_ENG 20385 12 EST 1064 13 FRO 16062 14 FIN 2532 15 FRA 2521 16 DEU 2431 17 GIB 17098 18 GRC 1093 19 GGY 19914 20 HUN 1071 21 ISL 1034 22 IRL 1050 23 IMN 16888 24 ITA 1021 25 JEY 19914 26 OWID_KOS 13248 27 LVA 1084 28 LIE 11878 29 LTU 1140 30 LUX 1053 31 MLT 1130 32 MDA 1073 33 MCO 13288 34 MNE 2271 35 NLD 2604 36 MKD 5197 37 OWID_NIR 20580 38 NOR 1035 39 POL 2577 40 PRT 1049 41 ROU 1064 42 RUS 1039 43 SMR 11686 44 OWID_SCT 19575 45 SRB 4109 46 SVK 1083 47 SVN 2576 48 ESP 968 49 SWE 1069 50 CHE 2554 51 UKR 1075 52 GBR 968 53 VAT 19823 54 OWID_WLS 17970
# Filtering out countries with more than 10000 missing values
countries_to_drop = missing_values_count[missing_values_count['missing_values'] > 10000]['iso_code']
europe_data = europe_data[~europe_data['iso_code'].isin(countries_to_drop)]
europe_data.reset_index(drop=True, inplace=True)
europe_data
| iso_code | location | date | total_cases_per_million | total_deaths_per_million | stringency_index | reproduction_rate | population_density | median_age | aged_65_older | aged_70_older | gdp_per_capita | extreme_poverty | female_smokers | male_smokers | hospital_beds_per_thousand | life_expectancy | human_development_index | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ALB | Albania | 2020-01-05 | NaN | NaN | 0.0 | NaN | 104.871 | 38.0 | 13.188 | 8.643 | 11803.431 | 1.1 | 7.1 | 51.2 | 2.89 | 78.57 | 0.795 |
| 1 | ALB | Albania | 2020-01-06 | NaN | NaN | 0.0 | NaN | 104.871 | 38.0 | 13.188 | 8.643 | 11803.431 | 1.1 | 7.1 | 51.2 | 2.89 | 78.57 | 0.795 |
| 2 | ALB | Albania | 2020-01-07 | NaN | NaN | 0.0 | NaN | 104.871 | 38.0 | 13.188 | 8.643 | 11803.431 | 1.1 | 7.1 | 51.2 | 2.89 | 78.57 | 0.795 |
| 3 | ALB | Albania | 2020-01-08 | NaN | NaN | 0.0 | NaN | 104.871 | 38.0 | 13.188 | 8.643 | 11803.431 | 1.1 | 7.1 | 51.2 | 2.89 | 78.57 | 0.795 |
| 4 | ALB | Albania | 2020-01-09 | NaN | NaN | 0.0 | NaN | 104.871 | 38.0 | 13.188 | 8.643 | 11803.431 | 1.1 | 7.1 | 51.2 | 2.89 | 78.57 | 0.795 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 60882 | GBR | United Kingdom | 2022-10-22 | 368948.549 | 3438.241 | NaN | NaN | 272.898 | 40.8 | 18.517 | 12.527 | 39753.244 | 0.2 | 20.0 | 24.7 | 2.54 | 81.32 | 0.932 |
| 60883 | GBR | United Kingdom | 2022-10-23 | 368948.549 | 3438.241 | NaN | NaN | 272.898 | 40.8 | 18.517 | 12.527 | 39753.244 | 0.2 | 20.0 | 24.7 | 2.54 | 81.32 | 0.932 |
| 60884 | GBR | United Kingdom | 2022-10-24 | 368948.549 | 3438.241 | NaN | NaN | 272.898 | 40.8 | 18.517 | 12.527 | 39753.244 | 0.2 | 20.0 | 24.7 | 2.54 | 81.32 | 0.932 |
| 60885 | GBR | United Kingdom | 2022-10-25 | 368948.549 | 3438.241 | NaN | NaN | 272.898 | 40.8 | 18.517 | 12.527 | 39753.244 | 0.2 | 20.0 | 24.7 | 2.54 | 81.32 | 0.932 |
| 60886 | GBR | United Kingdom | 2022-10-26 | 368969.317 | 3438.241 | NaN | NaN | 272.898 | 40.8 | 18.517 | 12.527 | 39753.244 | 0.2 | 20.0 | 24.7 | 2.54 | 81.32 | 0.932 |
60887 rows × 18 columns
print(europe_data['location'].unique())
print('Total number of countries:', europe_data['location'].nunique())
['Albania' 'Austria' 'Belarus' 'Belgium' 'Bosnia and Herzegovina' 'Bulgaria' 'Croatia' 'Cyprus' 'Czechia' 'Denmark' 'Estonia' 'Finland' 'France' 'Germany' 'Greece' 'Hungary' 'Iceland' 'Ireland' 'Italy' 'Latvia' 'Lithuania' 'Luxembourg' 'Malta' 'Moldova' 'Montenegro' 'Netherlands' 'North Macedonia' 'Norway' 'Poland' 'Portugal' 'Romania' 'Russia' 'Serbia' 'Slovakia' 'Slovenia' 'Spain' 'Sweden' 'Switzerland' 'Ukraine' 'United Kingdom'] Total number of countries: 40
The function interpolate is used to fill missing values in a dataset with estimated values based on the surrounding data points. By applying this function separately for each country (location), interpolation is performed independently for each country's data. This approach makes sense since different countries may have distinct trends and patterns in their data, so interpolating separately for each country helps preserve these unique characteristics. Additionally, when handling temporal data, the interpolate function proves to be beneficial due to its ability to accurately capture and maintain individual characteristics specific to each country.
# Handling missing values using linear interpolation for temporal data
# Define a function for interpolation
def interpolate_group(group):
# Apply linear interpolation to numerical columns within each group
group.loc[:, group.select_dtypes(include=['float64']).columns] = group.select_dtypes(include=['float64']).interpolate(method='linear', limit_direction='both')
return group
# Apply interpolation separately for each country
interpolated_df = europe_data.copy()
interpolated_df = interpolated_df.groupby('location').apply(interpolate_group)
# Reset index to remove the 'location' index level
interpolated_df.reset_index(drop=True, inplace=True)
interpolated_df
| iso_code | location | date | total_cases_per_million | total_deaths_per_million | stringency_index | reproduction_rate | population_density | median_age | aged_65_older | aged_70_older | gdp_per_capita | extreme_poverty | female_smokers | male_smokers | hospital_beds_per_thousand | life_expectancy | human_development_index | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ALB | Albania | 2020-01-05 | 11.610 | 0.352 | 0.00 | 1.15 | 104.871 | 38.0 | 13.188 | 8.643 | 11803.431 | 1.1 | 7.1 | 51.2 | 2.89 | 78.57 | 0.795 |
| 1 | ALB | Albania | 2020-01-06 | 11.610 | 0.352 | 0.00 | 1.15 | 104.871 | 38.0 | 13.188 | 8.643 | 11803.431 | 1.1 | 7.1 | 51.2 | 2.89 | 78.57 | 0.795 |
| 2 | ALB | Albania | 2020-01-07 | 11.610 | 0.352 | 0.00 | 1.15 | 104.871 | 38.0 | 13.188 | 8.643 | 11803.431 | 1.1 | 7.1 | 51.2 | 2.89 | 78.57 | 0.795 |
| 3 | ALB | Albania | 2020-01-08 | 11.610 | 0.352 | 0.00 | 1.15 | 104.871 | 38.0 | 13.188 | 8.643 | 11803.431 | 1.1 | 7.1 | 51.2 | 2.89 | 78.57 | 0.795 |
| 4 | ALB | Albania | 2020-01-09 | 11.610 | 0.352 | 0.00 | 1.15 | 104.871 | 38.0 | 13.188 | 8.643 | 11803.431 | 1.1 | 7.1 | 51.2 | 2.89 | 78.57 | 0.795 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 60882 | GBR | United Kingdom | 2022-10-22 | 368948.549 | 3438.241 | 5.56 | 0.30 | 272.898 | 40.8 | 18.517 | 12.527 | 39753.244 | 0.2 | 20.0 | 24.7 | 2.54 | 81.32 | 0.932 |
| 60883 | GBR | United Kingdom | 2022-10-23 | 368948.549 | 3438.241 | 5.56 | 0.30 | 272.898 | 40.8 | 18.517 | 12.527 | 39753.244 | 0.2 | 20.0 | 24.7 | 2.54 | 81.32 | 0.932 |
| 60884 | GBR | United Kingdom | 2022-10-24 | 368948.549 | 3438.241 | 5.56 | 0.30 | 272.898 | 40.8 | 18.517 | 12.527 | 39753.244 | 0.2 | 20.0 | 24.7 | 2.54 | 81.32 | 0.932 |
| 60885 | GBR | United Kingdom | 2022-10-25 | 368948.549 | 3438.241 | 5.56 | 0.30 | 272.898 | 40.8 | 18.517 | 12.527 | 39753.244 | 0.2 | 20.0 | 24.7 | 2.54 | 81.32 | 0.932 |
| 60886 | GBR | United Kingdom | 2022-10-26 | 368969.317 | 3438.241 | 5.56 | 0.30 | 272.898 | 40.8 | 18.517 | 12.527 | 39753.244 | 0.2 | 20.0 | 24.7 | 2.54 | 81.32 | 0.932 |
60887 rows × 18 columns
# Displaying the number of missing values after interpolation
print('Number of missing values:')
for col in interpolated_df.columns:
print('\t%s: %d' % (col,interpolated_df[col].isna().sum()))
Number of missing values: iso_code: 0 location: 0 date: 0 total_cases_per_million: 0 total_deaths_per_million: 0 stringency_index: 3040 reproduction_rate: 0 population_density: 0 median_age: 0 aged_65_older: 0 aged_70_older: 1520 gdp_per_capita: 0 extreme_poverty: 16742 female_smokers: 1520 male_smokers: 1520 hospital_beds_per_thousand: 0 life_expectancy: 0 human_development_index: 0
The "extreme_poverty" variable has been removed from the dataset due to a still significant number of missing values. Additionally, many insights related to extreme poverty are already captured within the Human Development Index (HDI), which provides a more comprehensive measure of poverty indicators. Afterwards, it was chosen to drop the remaining missing values using the dropna() method.
Thus, in conclusion, the final tidy dataset contains approximately 55,000 observations, covering a time window from 2020 to the beginning of 2024 and containing 14 numerical features directly related to the pandemic contention for 37 European countries.
interpolated_df = interpolated_df.drop(columns=['extreme_poverty'])
interpolated_df = interpolated_df.dropna().reset_index(drop=True)
interpolated_df
| iso_code | location | date | total_cases_per_million | total_deaths_per_million | stringency_index | reproduction_rate | population_density | median_age | aged_65_older | aged_70_older | gdp_per_capita | female_smokers | male_smokers | hospital_beds_per_thousand | life_expectancy | human_development_index | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ALB | Albania | 2020-01-05 | 11.610 | 0.352 | 0.00 | 1.15 | 104.871 | 38.0 | 13.188 | 8.643 | 11803.431 | 7.1 | 51.2 | 2.89 | 78.57 | 0.795 |
| 1 | ALB | Albania | 2020-01-06 | 11.610 | 0.352 | 0.00 | 1.15 | 104.871 | 38.0 | 13.188 | 8.643 | 11803.431 | 7.1 | 51.2 | 2.89 | 78.57 | 0.795 |
| 2 | ALB | Albania | 2020-01-07 | 11.610 | 0.352 | 0.00 | 1.15 | 104.871 | 38.0 | 13.188 | 8.643 | 11803.431 | 7.1 | 51.2 | 2.89 | 78.57 | 0.795 |
| 3 | ALB | Albania | 2020-01-08 | 11.610 | 0.352 | 0.00 | 1.15 | 104.871 | 38.0 | 13.188 | 8.643 | 11803.431 | 7.1 | 51.2 | 2.89 | 78.57 | 0.795 |
| 4 | ALB | Albania | 2020-01-09 | 11.610 | 0.352 | 0.00 | 1.15 | 104.871 | 38.0 | 13.188 | 8.643 | 11803.431 | 7.1 | 51.2 | 2.89 | 78.57 | 0.795 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 56322 | GBR | United Kingdom | 2022-10-22 | 368948.549 | 3438.241 | 5.56 | 0.30 | 272.898 | 40.8 | 18.517 | 12.527 | 39753.244 | 20.0 | 24.7 | 2.54 | 81.32 | 0.932 |
| 56323 | GBR | United Kingdom | 2022-10-23 | 368948.549 | 3438.241 | 5.56 | 0.30 | 272.898 | 40.8 | 18.517 | 12.527 | 39753.244 | 20.0 | 24.7 | 2.54 | 81.32 | 0.932 |
| 56324 | GBR | United Kingdom | 2022-10-24 | 368948.549 | 3438.241 | 5.56 | 0.30 | 272.898 | 40.8 | 18.517 | 12.527 | 39753.244 | 20.0 | 24.7 | 2.54 | 81.32 | 0.932 |
| 56325 | GBR | United Kingdom | 2022-10-25 | 368948.549 | 3438.241 | 5.56 | 0.30 | 272.898 | 40.8 | 18.517 | 12.527 | 39753.244 | 20.0 | 24.7 | 2.54 | 81.32 | 0.932 |
| 56326 | GBR | United Kingdom | 2022-10-26 | 368969.317 | 3438.241 | 5.56 | 0.30 | 272.898 | 40.8 | 18.517 | 12.527 | 39753.244 | 20.0 | 24.7 | 2.54 | 81.32 | 0.932 |
56327 rows × 17 columns
print(interpolated_df['location'].unique())
print('Total number of countries:', interpolated_df['location'].nunique())
['Albania' 'Austria' 'Belarus' 'Belgium' 'Bosnia and Herzegovina' 'Bulgaria' 'Croatia' 'Cyprus' 'Czechia' 'Denmark' 'Estonia' 'Finland' 'France' 'Germany' 'Greece' 'Hungary' 'Iceland' 'Ireland' 'Italy' 'Latvia' 'Lithuania' 'Luxembourg' 'Malta' 'Moldova' 'Netherlands' 'Norway' 'Poland' 'Portugal' 'Romania' 'Russia' 'Slovakia' 'Slovenia' 'Spain' 'Sweden' 'Switzerland' 'Ukraine' 'United Kingdom'] Total number of countries: 37
In this context, data visualisation can be useful for understanding the spread of the pandemic, identifying trends and geographical variations, and assessing the effectiveness of control measures. To this end, various visualisation techniques will be applied in the following section to try to obtain an initial idea of the situation.
To begin with, the first graph is an interactive map visualising the monthly incidence of total COVID-19 cases per million inhabitants in European countries over the period 2020-2024. This type of visualisation is extremely useful in providing a clear and immediate overview. It facilitates the understanding of the spread of the pandemic in time and space, allowing observers to identify areas with high infection rates, monitor the evolution of the situation over the years and identify any geographical or temporal patterns.
map_data = interpolated_df.loc[:, ['location', 'date', 'total_cases_per_million']]
map_data['year_month'] = map_data['date'].dt.to_period('M')
map_data.sort_values(by='year_month', inplace=True)
monthly_average = map_data.groupby(['location', 'year_month'])['total_cases_per_million'].mean().reset_index().round()
# buckets to facilitate visualization
monthly_average['cases_bucket'] = pd.qcut(monthly_average['total_cases_per_million'], q=5, labels=False)
fig = px.choropleth(
monthly_average,
locations="location",
locationmode="country names",
color="cases_bucket",
animation_frame="year_month",
title="Monthly Incidence of Total Cases per Million Residents (2020-2023) - European Countries",
labels={'cases_bucket': 'Cases Bucket'},
color_continuous_scale="inferno_r"
)
fig.update_geos(projection_type="natural earth")
fig.update_geos(
lataxis_range=[30, 85],
lonaxis_range=[-25, 75],
showcountries=True,
countrycolor="white"
)
fig.update_layout(
geo=dict(showframe=False, showcoastlines=False),
height=600,
legend=dict(
font=dict(size=10)
)
)
fig.show()
To analyse the trend of the pandemic, time series graphs were constructed, in particular, three metrics were chosen, their monthly averages calculated across all countries and then plotted.
The graph is then divided into three parts:
The first subplot (left) displays the total cases per million, which seem to fluctuate considerably over the time period considered.
The second subplot (centre) represents total deaths per million, which seem to follow a similar trend to total cases per million, but with lower values.
The third subplot (right) illustrates the trend in the index of the severity of restrictive measures taken by governments, which seems to be inversely correlated with total cases and deaths per million.
Overall, the time series graph suggests a possible relationship between the severity index and the spread of the disease. Periods with a higher severity index seem to coincide with fewer total cases and deaths per million, and vice versa.
monthly_avg = interpolated_df.drop(columns=['iso_code', 'location']).resample('M', on='date').mean().reset_index()
columns_to_plot = ['total_cases_per_million', 'stringency_index', 'total_deaths_per_million']
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(20, 6))
axes[0].plot(monthly_avg['date'], monthly_avg['total_cases_per_million'], label='Total Cases', linewidth=2)
axes[0].set_title('Total Cases per Million')
axes[0].set_xlabel('Date')
axes[0].set_ylabel('Total Cases per Million')
axes[0].grid(True)
axes[0].tick_params(axis='x', rotation=45)
axes[1].plot(monthly_avg['date'], monthly_avg['total_deaths_per_million'], color='green', label='Total Deaths', linewidth=2)
axes[1].set_title('Total Deaths per Million')
axes[1].set_xlabel('Date')
axes[1].set_ylabel('Total Deaths per Million')
axes[1].grid(True)
axes[1].tick_params(axis='x', rotation=45)
axes[2].plot(monthly_avg['date'], monthly_avg['stringency_index'], color='orange', label='Stringency Index', linewidth=2)
axes[2].set_title('Stringency Index')
axes[2].set_xlabel('Date')
axes[2].set_ylabel('Stringency Index')
axes[2].grid(True)
axes[2].tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()
The same principle applies to the plot below, whose purpose is to display the trend of the same measures across the various months of the year. It can be noted here, too, that as the stringency index decreases, the number of cases tends to increase. In addition to this, the restraining measures seem to have a decreasing trend going towards the summer months. opposite trend for total cases and deaths. This may be attributable to the fact that the summer months offer more opportunities for people to come into contact with each other and this may influence a more rapid spread of the virus.
monthly_data = interpolated_df.loc[:, ['date', 'total_cases_per_million', 'stringency_index', 'total_deaths_per_million']]
monthly_data['month'] = interpolated_df['date'].dt.month
monthly_data.sort_values(by='month', inplace=True)
monthly_average = monthly_data.drop(columns=['date']).groupby('month').mean().reset_index().round()
# Monthly average data
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
confirmed_cases = monthly_average['total_cases_per_million']
deaths = monthly_average['total_deaths_per_million']
stringency_index = monthly_average['stringency_index']
# Standardization of data using z-score normalization
standardized_cases = (confirmed_cases - np.mean(confirmed_cases)) / np.std(confirmed_cases)
standardized_deaths = (deaths - np.mean(deaths)) / np.std(deaths)
standardized_stringency_index = (stringency_index - np.mean(stringency_index)) / np.std(stringency_index)
plt.figure(figsize=(10, 6))
plt.plot(months, standardized_cases, marker='o', color='blue', label='Total Cases per Million')
plt.plot(months, standardized_deaths, marker='o', color='red', label='Total Deaths per Million')
plt.plot(months, standardized_stringency_index, marker='o', color='green', label='Stringency Index')
plt.xlabel('Month')
plt.ylabel('Standardized Value')
plt.title('Trend of COVID-19 Variables in Europe')
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Concluding this section of data visualisation is a representation of the correlation matrix. The graph in question is characterised by a heatmap showing the correlations between different numerical variables in the dataset, calculated through Pearson's correlation coefficient. This coefficient varies between -1 and 1, where:
This plot supports the identification of significant patterns and relationships between variables, the understanding of their mutual influences and the analysis and decision-making based on the data with immediate visual information.
The variable with the most significant correlation with the total number of cases appears to be the stringency index. In fact, these two variables show a negative correlation of 60%, indicating that an increase in the stringency index is associated with a decrease in the total number of cases. This relationship will be examined in more detail in the next part of the analysis.
correlation_matrix = interpolated_df.select_dtypes(include='number').corr()
plt.figure(figsize=(20, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
To further investigate the previously mentioned relationship between total cases and the severity of containment measures, a linear regression was performed using the stringency index as the independent variable and the total number of cases per million as the dependent variable. In order to evaluate the model's performance, the data was split into a training set (70%) and a test set (30%).
The linear regression yielded an $R^2$ score of approximately 13.97%, indicating that the model is able to explain only a small portion of the variability in the data. The visualization of the model clearly shows the general trend, but the low R² suggests that other factors might significantly influence the total COVID-19 cases. This was expected given the complexity of the phenomenon, whose explanation cannot be attributed to a single factor.
stringency_cases_average = interpolated_df.groupby(['stringency_index'])[['total_cases_per_million']].mean().reset_index()
stringency_cases_average
| stringency_index | total_cases_per_million | |
|---|---|---|
| 0 | 0.00 | 131569.107451 |
| 1 | 2.78 | 0.809709 |
| 2 | 5.56 | 333372.352213 |
| 3 | 8.33 | 119784.826357 |
| 4 | 9.26 | 161962.701781 |
| ... | ... | ... |
| 2414 | 90.74 | 2288.830881 |
| 2415 | 92.59 | 539.770630 |
| 2416 | 93.52 | 2982.817955 |
| 2417 | 94.44 | 850.006000 |
| 2418 | 96.30 | 278.564171 |
2419 rows × 2 columns
# Feature/regressor (X) and target variable (y)
X = stringency_cases_average.loc[:, ['stringency_index']].values
y = stringency_cases_average['total_cases_per_million'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# training = 70% of points
# testing = 30% of points
X
array([[ 0. ],
[ 2.78],
[ 5.56],
...,
[93.52],
[94.44],
[96.3 ]])
y
array([1.31569107e+05, 8.09709302e-01, 3.33372352e+05, ...,
2.98281795e+03, 8.50006000e+02, 2.78564171e+02])
# Linear Regression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
# Predictions
y_pred = lin_reg.predict(X_test)
# R-squared
R2_score = r2_score(y_test, y_pred)
print("Simple Linear Regression R2 Score: {:.2f}%".format(R2_score * 100))
Simple Linear Regression R2 Score: 13.97%
# Visualizing Linear Regression result
plt.figure(figsize=(14, 7))
plt.scatter(X, y, color='black', s=5, label='Observed Data')
plt.plot(X_test, y_pred, color='blue', linewidth=2, label='Linear Regression Model')
plt.legend()
plt.xlabel('Government Response Stringency Index')
plt.ylabel('Average confirmed cases of COVID-19 per 1,000,000 people')
plt.title('Linear Regression')
plt.show()
To improve the model, a polynomial regression with varying degrees of complexity was applied, from 1 to 10, which offers greater flexibility in capturing the non-linearities present in the data. Using cross-validation and GridSearchCV, it was determined that the optimal polynomial grade is 8. This model produced an $R^2$ score of 15.44%, slightly better than simple linear regression, suggesting a greater ability to adapt to the observed data. The polynomial regression display shows how the model more accurately captures fluctuations in data than the linear model.
The process of determining the optimal polynomial degree underscored the importance of striking a balance between model complexity and the risk of overfitting. Leveraging GridSearchCV ensured a robust evaluation of the best model, considering various possible configurations.
Moreover, a visual examination of Mean Squared Error (MSE) and $R^2$ scores supported the selection of the optimal degree, affirming that a degree of 8 yields the lowest MSE and the highest $R^2$ value.
def PolynomialRegression(degree=2, **kwargs):
return make_pipeline(PolynomialFeatures(degree),
LinearRegression(**kwargs))
X_range = np.linspace(X.min(), X.max(), 500).reshape(-1, 1)
plt.figure(figsize=(14, 7))
degrees = np.arange(1, 11)
for degree in degrees:
y_pred = PolynomialRegression(degree).fit(X_train, y_train).predict(X_range)
plt.plot(X_range.ravel(), y_pred, label='degree={0}'.format(degree))
plt.scatter(X, y, color='black', s=5)
plt.xlabel('Government Response Stringency Index')
plt.ylabel('Average confirmed cases of COVID-19 per 1,000,000 people')
plt.title('Polynomial Regression')
plt.legend(loc='best')
plt.show()
# Dompute MSE and R-squared
MSE = []
R2_score = []
for degree in degrees:
y_pred = PolynomialRegression(degree).fit(X_train, y_train).predict(X_test)
MSE.append(mean_squared_error(y_test, y_pred))
R2_score.append(r2_score(y_test, y_pred))
plt.figure(figsize=(14, 7))
plt.subplot(2, 1, 1)
plt.plot(degrees, MSE, color='blue', marker='o', label='MSE')
plt.axvline(x=(MSE.index(min(MSE)) + 1), color='red', linestyle='--', label='optimal degree value')
plt.legend(loc='best')
plt.subplot(2, 1, 2)
plt.plot(degrees, R2_score, color='orange', marker='o', label='R2 Score')
plt.xlabel('Degree')
plt.axvline(x=(R2_score.index(max(R2_score)) + 1), color='red', linestyle='--', label='optimal degree value')
plt.legend(loc='best')
<matplotlib.legend.Legend at 0x1894ee89750>
# Determine the optimal degree_value
if MSE.index(min(MSE)) == R2_score.index(max(R2_score)):
optimal_degree = MSE.index(min(MSE)) + 1
print("The optimal degree value is", optimal_degree, "with an R2 of {:.2f}%".format(R2_score[optimal_degree-1] * 100))
The optimal degree value is 8 with an R2 of 15.44%
# Find optimal degree value with GridSearchCV
plt.figure(figsize=(14, 7))
param_grid = {'polynomialfeatures__degree': np.arange(1, 11),
'linearregression__fit_intercept': [True, False]}
grid = GridSearchCV(PolynomialRegression(), param_grid, cv=5)
grid.fit(X_train, y_train)
print(grid.best_params_)
model = grid.best_estimator_
plt.scatter(X, y, color='black', s=5, label='Observed Data')
lim = plt.axis()
y_test = model.fit(X_train, y_train).predict(X_range)
plt.plot(X_range.ravel(), y_test, color='blue', linewidth=2,
label=f'Polynomial Regression degree {grid.best_params_["polynomialfeatures__degree"]}')
plt.xlabel('Government Response Stringency Index')
plt.ylabel('Average confirmed cases of COVID-19 per 1,000,000 people')
plt.title('Linear Regression - COVID-19 Cases vs Stringency Index')
plt.title('Polynomial Regression')
plt.legend(loc='best')
plt.show()
{'linearregression__fit_intercept': False, 'polynomialfeatures__degree': 8}
The analysis using linear and polynomial regression revealed a negative relationship between the stringency index and the total cases of COVID-19 per million inhabitants. Although the models showed a limited ability to explain the variability in the data, polynomial regression provided a slightly more accurate representation than simple linear regression. These results indicate that stricter containment measures are generally associated with fewer confirmed cases, supporting the importance of government pandemic response policies.
Feature ranking analysis is critical in the data modeling process, as it helps identify the most influential variables in predicting the phenomenon of interest. In this context, three distinct approaches have been adopted:
Univariate Selection: This approach evaluates each feature independently of the target variable using statistical measures such as the F-score. It provides an indication of the force of the relationship between each characteristic and the output variable.
Feature Importance with Random Forest: This method involves using a Random Forest algorithm to determine the relative importance of each characteristic in the model. It evaluates the contribution of each variable to the reduction of the model prediction error, allowing to identify the most influential features.
Recursive Feature Elimination (RFE) with Lasso: This technique uses L1 regularization to iteratively select the most important features, eliminating the less relevant ones.
X = interpolated_df.drop(columns=['iso_code', 'location', 'date', 'total_cases_per_million'])
y = interpolated_df['total_cases_per_million']
# Standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Method 1: Univariate Selection
select_k_best = SelectKBest(score_func=f_regression, k='all')
select_k_best.fit(X_scaled, y)
scores = select_k_best.scores_
# Method 2: Feature Importance with Random Forest
model_rf = RandomForestRegressor()
model_rf.fit(X_scaled, y)
importances = model_rf.feature_importances_
# Method 3: Recursive Feature Elimination (RFE) with Lasso
model_lasso = LassoCV()
rfe = RFE(model_lasso, n_features_to_select=1)
rfe.fit(X_scaled, y)
ranking = rfe.ranking_
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
sorted_indices = np.argsort(scores)
axes[0].barh(np.array(X.columns)[sorted_indices], np.array(scores)[sorted_indices])
axes[0].set_xlabel('Scores')
axes[0].set_ylabel('Features')
axes[0].set_title('Feature Importance (Univariate Selection)')
sorted_indices = np.argsort(importances)
axes[1].barh(np.array(X.columns)[sorted_indices], np.array(importances)[sorted_indices])
axes[1].set_xlabel('Importance')
axes[1].set_ylabel('Features')
axes[1].set_title('Feature Importance (Random Forest)')
sorted_indices = np.argsort(ranking)[::-1]
axes[2].barh(np.array(X.columns)[sorted_indices], np.array(ranking)[sorted_indices])
axes[2].set_xlabel('Ranking')
axes[2].set_ylabel('Features')
axes[2].set_title('Feature Ranking (RFE with Lasso)')
plt.tight_layout()
plt.show()
It is worth noting that the stringency index consistently emerges as the foremost feature in all three rankings. This suggests that the extent of containment measures plays a pivotal role in determining the spread of COVID-19 cases.
In addition to the stringency index, other notable features include the reproduction rate, human development index, and life expectancy. These variables represent crucial determinants of a population's susceptibility to the virus and its capacity to effectively respond to the pandemic.
Interestingly, population density is highlighted as one of the most influential features solely in the ranking derived from the random forest method. This observation underscores the interplay between population density and various factors such as transmission rates and the challenges associated with implementing social distancing measures.
To comprehend the dynamics of total COVID-19 cases and identify the most influential variables in the phenomenon, Ridge and Lasso Regression models were applied. These models provide further insights into the relationships between independent features and the target variable, representing the relative weight of each feature in the prediction process.
Before model training, the data was split into training (70%) and test (30%) sets, and features were standardized to ensure consistency. Subsequently, through the utilization of GridSearchCV, the optimal value for the regularization parameter, known as alpha, was sought, which prevents overfitting by penalizing coefficients of excessively large variables.
The analysis of coefficients in Ridge and Lasso models enabled the extraction of valuable insights into the most influential variables in total COVID-19 cases. It was revealed that features such as mortality per million inhabitants or HDI have a significant and positive impact, while the stringency index exhibits a strong negative impact, indicating that stricter control measures are associated with a reduction in confirmed cases.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Parameter grid
alpha_values = np.logspace(-4, 6, 50)
# Ridge Regression with GridSearchCV
ridge = Ridge()
ridge_grid = GridSearchCV(ridge, param_grid={'alpha': alpha_values}, cv=5)
ridge_grid.fit(X_train_scaled, y_train)
# Lasso Regression with GridSearchCV
lasso = Lasso(max_iter=10000)
lasso_grid = GridSearchCV(lasso, param_grid={'alpha': alpha_values}, cv=5)
lasso_grid.fit(X_train_scaled, y_train)
best_ridge = ridge_grid.best_estimator_
best_lasso = lasso_grid.best_estimator_
print(f"Best alpha for Ridge: {ridge_grid.best_params_['alpha']}")
print(f"Best alpha for Lasso: {lasso_grid.best_params_['alpha']}")
# Plot the coefficients for different alphas
coefs_ridge = []
coefs_lasso = []
for a in alpha_values:
ridge.set_params(alpha=a)
lasso.set_params(alpha=a)
ridge.fit(X_train_scaled, y_train)
lasso.fit(X_train_scaled, y_train)
coefs_ridge.append(ridge.coef_)
coefs_lasso.append(lasso.coef_)
plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
plt.plot(alpha_values, coefs_ridge)
plt.xscale('log')
plt.xlabel('alpha')
plt.ylabel('Coefficients')
plt.title('Ridge Regression')
plt.axvline(ridge_grid.best_params_['alpha'], linestyle='--', color='r', label='Best alpha')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(alpha_values, coefs_lasso)
plt.xscale('log')
plt.xlabel('alpha')
plt.ylabel('Coefficients')
plt.title('Lasso Regression')
plt.axvline(lasso_grid.best_params_['alpha'], linestyle='--', color='r', label='Best alpha')
plt.legend()
plt.show()
Best alpha for Ridge: 3.088843596477485 Best alpha for Lasso: 1.9306977288832496
feature_names = X_train.columns.tolist()
# Best model coefficients
coef_dict = {
'Feature': feature_names,
'Ridge Coefficients': best_ridge.coef_,
'Lasso Coefficients': best_lasso.coef_
}
coefs_df = pd.DataFrame(coef_dict)
coefs_df.set_index('Feature', inplace=True)
coefs_df
| Ridge Coefficients | Lasso Coefficients | |
|---|---|---|
| Feature | ||
| total_deaths_per_million | 77039.543800 | 77043.488677 |
| stringency_index | -82967.874985 | -82971.339260 |
| reproduction_rate | -36556.283873 | -36557.108575 |
| population_density | 1243.494697 | 1254.589398 |
| median_age | 19442.415838 | 19446.235056 |
| aged_65_older | -66155.980061 | -66259.498030 |
| aged_70_older | 55028.543227 | 55112.753552 |
| gdp_per_capita | 14453.839349 | 14424.790910 |
| female_smokers | -25161.542357 | -25152.576091 |
| male_smokers | 50873.554625 | 50863.261094 |
| hospital_beds_per_thousand | -9000.336384 | -9003.559348 |
| life_expectancy | 33030.387866 | 33020.512221 |
| human_development_index | 75016.368738 | 75049.223679 |
As mentioned above, the analysis of coefficients in the Ridge and Lasso Regression models provides a further understanding of the relationships between the independent variables and the target variable. However, these models do not explicitly take into account the interactions between variables. This is where the Random Forest Regression comes in, offering another approach to feature selection and data prediction.
The Random Forest Regression makes an implicit selection of the characteristics, since it considers only a subset of the characteristics at each split of the tree. This process makes it possible to identify the most influential features in the context of COVID-19 total case prediction.
numBaseClassifiers = 500
maxdepth = 20
rf = RandomForestRegressor(n_estimators=numBaseClassifiers, max_depth=maxdepth, random_state=0)
rf.fit(X_train_scaled, y_train)
RandomForestRegressor(max_depth=20, n_estimators=500, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestRegressor(max_depth=20, n_estimators=500, random_state=0)
Building upon the analysis of Ridge and Lasso models, predictions are now generated using these models along with a Random Forest model on both training and test data. The results are then evaluated in terms of the coefficient of determination ($R^2$), mean squared error (MSE), and mean absolute error (MAE) for each model. These values provide a measure of how well the models fit the data and their ability to generalize.
Upon reviewing the results, it is observed that the Random Forest model exhibits the highest coefficient of determination on both training and test data, indicating greater adaptability. However, significant disparities between performance on training and test data suggest potential overfitting of the Random Forest model. This hypothesis is supported by the notably lower MSE and MAE values for Random Forest compared to Ridge and Lasso models, particularly on the training data.
These findings suggest that while Random Forest may have superior fitting capability to the training data, it may not generalize well to new data. In contrast, regularized Ridge and Lasso models appear to be more robust and generalizable. This conclusion is supported by the fact that despite slightly lower ($R^2$) coefficients, Ridge and Lasso models exhibit greater consistency between results on training and test data, indicating enhanced generalization ability.
# Predictions
ridge_train_pred = best_ridge.predict(X_train_scaled)
ridge_test_pred = best_ridge.predict(X_test_scaled)
lasso_train_pred = best_lasso.predict(X_train_scaled)
lasso_test_pred = best_lasso.predict(X_test_scaled)
rf_train_pred = rf.predict(X_train_scaled)
rf_test_pred = rf.predict(X_test_scaled)
# R-squared
ridge_train_r2 = r2_score(y_train, ridge_train_pred)
ridge_test_r2 = r2_score(y_test, ridge_test_pred)
lasso_train_r2 = r2_score(y_train, lasso_train_pred)
lasso_test_r2 = r2_score(y_test, lasso_test_pred)
rf_train_r2 = r2_score(y_train, rf_train_pred)
rf_test_r2 = r2_score(y_test, rf_test_pred)
# MSE and MAE
ridge_train_mse = mean_squared_error(y_train, ridge_train_pred)
ridge_test_mse = mean_squared_error(y_test, ridge_test_pred)
lasso_train_mse = mean_squared_error(y_train, lasso_train_pred)
lasso_test_mse = mean_squared_error(y_test, lasso_test_pred)
rf_train_mse = mean_squared_error(y_train, rf_train_pred)
rf_test_mse = mean_squared_error(y_test, rf_test_pred)
ridge_train_mae = mean_absolute_error(y_train, ridge_train_pred)
ridge_test_mae = mean_absolute_error(y_test, ridge_test_pred)
lasso_train_mae = mean_absolute_error(y_train, lasso_train_pred)
lasso_test_mae = mean_absolute_error(y_test, lasso_test_pred)
rf_train_mae = mean_absolute_error(y_train, rf_train_pred)
rf_test_mae = mean_absolute_error(y_test, rf_test_pred)
metrics_data = {
'Model': ['Ridge Regression (Train)', 'Ridge Regression (Test)',
'Lasso Regression (Train)', 'Lasso Regression (Test)',
'Random Forest (Train)', 'Random Forest (Test)'],
'R-squared': [ridge_train_r2, ridge_test_r2, lasso_train_r2, lasso_test_r2, rf_train_r2, rf_test_r2],
'MSE': [ridge_train_mse, ridge_test_mse, lasso_train_mse, lasso_test_mse, rf_train_mse, rf_test_mse],
'MAE': [ridge_train_mae, ridge_test_mae, lasso_train_mae, lasso_test_mae, rf_train_mae, rf_test_mae]
}
metrics_df = pd.DataFrame(metrics_data)
metrics_df
| Model | R-squared | MSE | MAE | |
|---|---|---|---|---|
| 0 | Ridge Regression (Train) | 0.618561 | 1.484414e+10 | 94991.508107 |
| 1 | Ridge Regression (Test) | 0.611601 | 1.481191e+10 | 94793.336223 |
| 2 | Lasso Regression (Train) | 0.618561 | 1.484414e+10 | 94992.071059 |
| 3 | Lasso Regression (Test) | 0.611602 | 1.481187e+10 | 94793.972237 |
| 4 | Random Forest (Train) | 0.999981 | 7.462927e+05 | 225.051385 |
| 5 | Random Forest (Test) | 0.999947 | 2.012459e+06 | 367.423008 |
The analysis proceeds with Principal Component Analysis (PCA), a technique aimed at reducing the dimensionality of the original dataset by generating new variables known as principal components. The main objective of this phase is to identify which and how many principal components can capture the majority of the variance present in the original data, thus facilitating analysis and improving the effectiveness of predictive models.
The variance explained by each principal component is graphically displayed, indicating the contribution of each component to the overall variation in the data. As shown, PC1 explains approximately 32.46% of the total variation in the data, followed by PC2 with around 22.74%, and so forth. The bar graph clearly illustrates the decrease in explained variance as one moves towards lower principal components.
The second graph depicts the cumulative proportion of explained variance relative to the number of principal components. It is observed that the first two components explain approximately 55% of the variation, while the first three explain about 67%, and so on. This graph aids in determining the number of principal components to select in order to retain a significant amount of information, balancing dimensionality reduction with preservation of variation in the data.
As illustrated, the first eight principal components account for the majority of the variance observed. Given that there is no significant improvement in the amount of additional information provided beyond the eighth component, only these initial eight components were included in the model construction. Subsequently, a linear regression model was developed, utilizing the total number of cases as the dependent variable and the eight principal components as regressors. Yielding an $R^2$ of around 55%, this suggests that this model is able to explain slightly more than half of the variance in the original data.
pca = PCA(n_components=10)
principal_components = pca.fit_transform(X_scaled)
principal_df = pd.DataFrame(data=principal_components, columns=[f'PC{i}' for i in range(1, 11)])
print("Principal Components:\n", principal_df)
print("\nVariance explained by each principal component (%):", pca.explained_variance_ratio_ * 100)
plt.figure(figsize=(15, 6))
plt.bar(range(1, 11), pca.explained_variance_ratio_ * 100, alpha=0.7)
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance (%)')
plt.title('PCA: Variance Explained by each Principal Component')
plt.show()
# Explained variance ratios
explained_variance_ratio = pca.explained_variance_ratio_
explained_variance_ratio_percent = explained_variance_ratio * 100
# Cumulative explained variance ratios
cumulative_explained_variance_ratio = np.cumsum(explained_variance_ratio)
cumulative_explained_variance_ratio_percent = cumulative_explained_variance_ratio * 100
plt.figure(figsize=(15, 6))
plt.subplot(1, 2, 1)
plt.plot(range(1, 11), explained_variance_ratio_percent, marker='o', color='blue')
plt.xlabel('Principal Component')
plt.ylabel('Proportion of Variance Explained (%)')
plt.ylim(0, 100)
plt.subplot(1, 2, 2)
plt.plot(range(1, 11), cumulative_explained_variance_ratio_percent, marker='o', color='red')
plt.xlabel('Principal Component')
plt.ylabel('Cumulative Proportion of Variance Explained (%)')
plt.ylim(0, 100)
plt.tight_layout()
plt.show()
Principal Components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 \
0 3.448808 2.206704 0.123304 0.662258 0.904475 1.856561 -0.334957
1 3.448808 2.206704 0.123304 0.662258 0.904475 1.856561 -0.334957
2 3.448808 2.206704 0.123304 0.662258 0.904475 1.856561 -0.334957
3 3.448808 2.206704 0.123304 0.662258 0.904475 1.856561 -0.334957
4 3.448808 2.206704 0.123304 0.662258 0.904475 1.856561 -0.334957
... ... ... ... ... ... ... ...
56322 -1.064215 0.742045 -2.463546 0.337230 -0.798046 0.670403 -0.282067
56323 -1.064215 0.742045 -2.463546 0.337230 -0.798046 0.670403 -0.282067
56324 -1.064215 0.742045 -2.463546 0.337230 -0.798046 0.670403 -0.282067
56325 -1.064215 0.742045 -2.463546 0.337230 -0.798046 0.670403 -0.282067
56326 -1.064215 0.742045 -2.463546 0.337230 -0.798046 0.670403 -0.282067
PC8 PC9 PC10
0 1.683176 0.329584 0.015011
1 1.683176 0.329584 0.015011
2 1.683176 0.329584 0.015011
3 1.683176 0.329584 0.015011
4 1.683176 0.329584 0.015011
... ... ... ...
56322 0.006968 -0.229590 -0.129170
56323 0.006968 -0.229590 -0.129170
56324 0.006968 -0.229590 -0.129170
56325 0.006968 -0.229590 -0.129170
56326 0.006968 -0.229590 -0.129170
[56327 rows x 10 columns]
Variance explained by each principal component (%): [32.45714308 22.74048131 11.48806913 7.47604858 6.91333489 6.3066427
5.42581494 2.98735384 1.43279099 1.15009652]
X_train, X_test, y_train, y_test = train_test_split(principal_df.iloc[:, :8], y, test_size=0.3, random_state=0)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
R2_score = r2_score(y_test, y_pred)
print("R-squared (Simple Linear Regression using the first 8 principal components): {:.2f}%".format(R2_score * 100))
R-squared (Simple Linear Regression using the first 8 principal components): 55.68%
If in the initial phase of Principal Component Analysis (PCA) the total cases variable was excluded because it served as the response variable for the developing model, now, for a more comprehensive overview, all variables have been incorporated into the analysis. Specifically, data from 2021 was selected due to its complete annual coverage, representing the pandemic's second year—a period marked by rapid changes. Consequently, these data offer a representative snapshot of the context. The goal remains to utilize PCA to distill the data into a two-dimensional representation, identifying the variables that distinguish between different countries.
In doing so, the dataset is first prepared by excluding unnecessary instances and then standardized to ensure that all variables are on the same scale, a prerequisite for PCA. After standardization, PCA is performed with the aim of reducing the dimensionality of the dataset while preserving its variability. Here, only two principal components are retained for visualization purposes. These principal components represent new axes in the feature space, capturing the directions of maximum variance in the data.
The resulting principal components are then used to create a biplot, a visualization technique that combines a scatter plot of the data points with arrows indicating the direction and magnitude of each variable's contribution to the principal components. Each data point represents a country, positioned based on its scores on the two principal components. The arrows indicate the direction in which each variable has the highest variability, helping to interpret the relationships between variables and countries in the reduced dimensional space.
This biplot provides a comprehensive overview of the dataset, highlighting the variables that contribute most significantly to the observed differences among countries. By examining the position of countries relative to each other and the direction of variable arrows, it is possible gain insights into the underlying factors driving variations in COVID-19 dynamics across different regions.
data_2021 = interpolated_df[interpolated_df['date'].dt.year == 2021].drop(columns=['iso_code', 'date'])
data_2021 = data_2021.groupby('location').mean().reset_index()
data_2021.set_index('location', inplace=True)
# Standardization
scaler = StandardScaler()
scaled_features = scaler.fit_transform(data_2021)
data_2021_standardized = pd.DataFrame(scaled_features, index=data_2021.index, columns=data_2021.columns)
data_2021_standardized
| total_cases_per_million | total_deaths_per_million | stringency_index | reproduction_rate | population_density | median_age | aged_65_older | aged_70_older | gdp_per_capita | female_smokers | male_smokers | hospital_beds_per_thousand | life_expectancy | human_development_index | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| location | ||||||||||||||
| Albania | -1.089345 | -1.063317 | 0.957055 | 0.073826 | -0.212402 | -1.651055 | -1.722618 | -1.516682 | -1.335070 | -2.309662 | 1.504746 | -1.071300 | -0.305650 | -1.640463 |
| Austria | -0.862988 | -0.512343 | 1.796914 | 0.057422 | -0.204547 | 0.832870 | 0.462239 | 0.836749 | 0.609927 | 0.925178 | -0.389680 | 1.110229 | 0.565179 | 0.664721 |
| Belarus | -1.119568 | -1.547532 | 0.032881 | -0.127895 | -0.455049 | -0.758394 | -1.137350 | -0.988831 | -1.024841 | -1.793303 | 1.028806 | 2.877852 | -1.413978 | -1.132233 |
| Belgium | -0.684730 | 0.192933 | 1.055259 | 0.778182 | 0.919806 | -0.176224 | 0.233000 | 0.422306 | 0.449270 | 0.424006 | -0.343019 | 0.267808 | 0.591568 | 0.828081 |
| Bosnia and Herzegovina | -0.974601 | 0.808688 | 0.325276 | 0.438819 | -0.364545 | 0.095455 | -0.494317 | -0.563323 | -1.340248 | 1.198545 | 1.178121 | -0.774261 | -0.648704 | -1.912729 |
| Bulgaria | -0.962302 | 0.685426 | 0.944469 | 0.240944 | -0.378415 | 0.949304 | 1.043148 | 0.617311 | -0.944149 | 1.183358 | 0.870160 | 1.151133 | -1.337744 | -1.259291 |
| Croatia | -0.753169 | -0.066129 | 0.508752 | 0.130215 | -0.342670 | 0.677625 | 0.651879 | 0.516351 | -0.706673 | 1.821214 | 0.450213 | 0.219114 | -0.329107 | -0.624004 |
| Cyprus | -0.742789 | -1.457023 | 1.704083 | 0.341420 | -0.117097 | -1.922734 | -1.639787 | -1.553562 | -0.143104 | -0.411282 | 1.644728 | -0.822956 | 0.400982 | 0.029434 |
| Czechia | -0.214312 | 0.678127 | 1.428409 | 0.241713 | -0.077283 | 0.405945 | 0.398663 | -0.162710 | -0.132072 | 1.244106 | 0.300899 | 0.749887 | -0.068151 | 0.265398 |
| Denmark | -1.087682 | -1.482057 | 0.874312 | 0.232742 | -0.080026 | 0.017832 | 0.634804 | 0.180739 | 0.681973 | -0.456843 | -1.518870 | -1.261210 | 0.377526 | 0.991440 |
| Estonia | 0.324327 | -0.594691 | -0.080621 | 0.126371 | -0.521239 | 0.173078 | 0.553063 | 0.718271 | -0.312769 | 0.332884 | 0.394220 | -0.194793 | -0.255805 | 0.120189 |
| Finland | -0.789873 | -1.473322 | -0.299976 | 0.688984 | -0.575182 | 0.211889 | 1.198275 | 0.613623 | 0.329398 | -0.608713 | -1.164249 | -0.881390 | 0.673666 | 0.955138 |
| France | 0.305238 | 0.061921 | 0.177812 | 0.401141 | -0.138341 | -0.098602 | 0.649700 | 0.528337 | 0.214893 | 1.183358 | 0.048931 | 0.433371 | 0.893573 | 0.283549 |
| Germany | -0.425626 | -0.538403 | 0.355988 | 1.101139 | 0.340311 | 1.686719 | 1.280017 | 1.855109 | 0.597931 | 0.894804 | -0.184373 | 1.417007 | 0.503605 | 1.118497 |
| Greece | -0.223598 | 0.027182 | 1.723625 | 0.304254 | -0.301877 | 1.182172 | 0.896014 | 1.194489 | -0.596531 | 1.973085 | 1.579403 | -0.428528 | 0.770425 | 0.047585 |
| Hungary | -0.421331 | 1.763023 | -0.520534 | 0.955039 | -0.199135 | 0.444757 | 0.235180 | 0.019848 | -0.469122 | 0.682186 | -0.025726 | 0.939797 | -0.801172 | -0.569551 |
| Iceland | -0.040377 | -1.741454 | -0.461346 | 0.492902 | -0.636801 | -1.922734 | -1.271042 | -1.256675 | 0.670433 | -1.216195 | -1.854827 | -1.061561 | 0.990331 | 1.154800 |
| Ireland | -0.110002 | -0.699949 | -0.190839 | 0.391401 | -0.358782 | -1.379376 | -1.453780 | -1.500546 | 1.876315 | 0.105078 | -0.874952 | -1.037213 | 0.788018 | 1.263706 |
| Italy | -0.268263 | 0.478000 | 0.958792 | 0.401141 | 0.209993 | 2.191267 | 1.849664 | 1.985574 | 0.019105 | -0.380908 | -0.678977 | -0.930085 | 1.142800 | 0.120189 |
| Latvia | 0.198300 | 0.409188 | 0.228637 | 0.014361 | -0.520490 | 0.638813 | 0.662778 | 1.015619 | -0.568226 | 0.499941 | 1.486082 | 0.233722 | -1.267374 | -0.351738 |
| Lithuania | 0.360302 | 0.472383 | -0.411640 | -0.229139 | -0.462256 | 0.483568 | 0.389580 | 0.850579 | -0.310282 | -0.153102 | 0.272902 | 0.715801 | -1.079720 | -0.061321 |
| Luxembourg | 0.115886 | -0.912504 | 0.407855 | 0.873275 | 0.317018 | -0.991262 | -1.314274 | -0.963937 | 3.434399 | -0.213850 | -0.846955 | -0.282443 | 0.773357 | 0.555815 |
| Malta | -0.682750 | -0.911775 | 0.602564 | 0.724099 | 5.430656 | 0.056644 | 0.543617 | -0.280727 | 0.093893 | -0.213850 | -0.455005 | -0.294617 | 0.855455 | 0.174643 |
| Moldova | -0.563953 | 0.780445 | 0.560153 | -0.333972 | -0.133836 | -1.806300 | -2.566916 | -2.294858 | -1.717524 | -2.491907 | 0.888824 | 0.345720 | -2.261350 | -2.457260 |
| Netherlands | 0.335986 | -0.774213 | 0.449904 | -0.139173 | 1.476011 | 0.367134 | 0.308565 | -0.023947 | 0.785490 | 0.317697 | -0.725638 | -0.861912 | 0.782153 | 1.064044 |
| Norway | 0.149735 | -1.430153 | -0.945025 | -0.272457 | -0.590549 | -0.991262 | -0.402767 | -0.516301 | 1.729704 | -0.411282 | -1.341559 | -0.725566 | 0.817338 | 1.300008 |
| Poland | -0.361788 | 0.772184 | -0.598087 | 0.101252 | -0.132280 | -0.176224 | -0.423838 | -0.797975 | -0.443742 | 0.150639 | -0.184373 | 0.745018 | -0.258737 | -0.097623 |
| Portugal | 1.532201 | 0.218246 | -0.721532 | 0.300153 | -0.181033 | 1.531474 | 1.297818 | 1.378891 | -0.402079 | -0.912454 | -0.473670 | -0.827826 | 0.714715 | -0.388040 |
| Romania | -0.368754 | 1.235655 | -0.229695 | -1.602479 | -0.294976 | 0.289512 | -0.028936 | -0.111999 | -0.669465 | 0.089891 | 0.188913 | 0.877468 | -1.044535 | -1.041478 |
| Russia | -0.572025 | 0.479116 | 0.229674 | -0.132508 | -0.614135 | -1.030074 | -1.362956 | -1.170928 | -0.585453 | 0.165826 | 2.167328 | 1.441354 | -2.061969 | -1.114082 |
| Slovakia | 1.298409 | 1.719596 | -1.458168 | -1.841365 | -0.177866 | -0.409092 | -1.038897 | -1.275115 | -0.273798 | 0.120265 | 0.244906 | 0.355459 | -0.607655 | -0.460644 |
| Slovenia | 3.429647 | 2.103149 | -1.458168 | -0.261948 | -0.221822 | 0.871681 | 0.411378 | 0.459647 | -0.201760 | -0.335346 | -0.940277 | -0.287313 | 0.500673 | 0.573966 |
| Spain | 0.936560 | 0.539929 | -1.415836 | 1.659395 | -0.261615 | 1.259795 | 0.547250 | 0.860260 | -0.035701 | 0.773308 | -0.343019 | -1.032344 | 1.157460 | 0.338002 |
| Sweden | 0.630015 | 0.124387 | -1.458168 | 1.440501 | -0.547652 | -0.486715 | 0.746699 | 0.691532 | 0.697400 | -0.532778 | -1.509538 | -1.397555 | 0.934622 | 1.082195 |
| Switzerland | 2.570077 | -0.338820 | -1.719780 | -1.883658 | 0.245060 | 0.328323 | 0.183955 | 0.327800 | 1.302349 | 0.044330 | -0.576323 | -0.272704 | 1.221966 | 1.263706 |
| Ukraine | -0.380084 | 0.784599 | -1.635182 | -3.170363 | -0.327345 | -0.331470 | -0.533190 | -0.368779 | -1.561129 | -1.337691 | 1.150124 | 1.806566 | -2.214437 | -1.930880 |
| United Kingdom | 1.513227 | 1.209506 | -1.717820 | -2.515732 | 0.490392 | -0.564338 | 0.213382 | 0.273862 | 0.281256 | -0.350533 | -0.968273 | -1.241732 | 0.500673 | 0.846232 |
pca = PCA(n_components=2) # only 2 principal components for visualization
principal_components = pca.fit_transform(data_2021_standardized)
plt.figure(figsize=(10, 8))
for i, country in enumerate(data_2021_standardized.index):
plt.scatter(principal_components[i, 0], principal_components[i, 1], alpha=0)
plt.text(principal_components[i, 0], principal_components[i, 1], country, color='black', fontsize=8, ha='left', va='bottom')
for i, var in enumerate(data_2021_standardized.columns):
plt.arrow(0, 0, pca.components_[0, i]*3, pca.components_[1, i]*3, color='red', alpha=0.2, head_width=0.02)
plt.text(pca.components_[0, i]*3.3, pca.components_[1, i]*3.3, var, color='red', ha='center', va='center', fontsize=8)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
Building upon the last point addressed in the PCA section, the analysis seeks to partition countries into groups sharing similar characteristics. This section explores clustering methods, employing both hierarchical clustering and KMeans clustering to segment the dataset into distinct groups.
Initially, hierarchical clustering is applied using various linkage methods (complete, single, and average). Dendrograms resulting from this process visualize the hierarchical relationships between countries, based on their feature similarities.
Subsequently, both hierarchical and KMeans clustering solutions are generated with the number of clusters set to 4, a choice informed by both the dendrogram analysis (for hierarchical clustering) and the elbow and silhouette method (for KMeans). Mean values of variables within each cluster are then calculated for both clustering methods. Looking at the values it is possible to gain some insights into the variables that characterize each cluster.
A comparative visualization of the cluster assignments overlays countries' positions in the principal component space derived from PCA. This facilitates the assessment of how well the clustering algorithms group countries with similar characteristics. The final plot illustrates that while both methods share similarities, they also present differences in how countries are distributed across clusters. This indicates that the selection of a clustering method can impact data interpretation. Hierarchical clustering tends to segregate countries based on more pronounced disparities, while k-means clustering demonstrates a smoother distribution.
To gauge the effectiveness of these methods, the Adjusted Rand Index (ARI) is utilized, providing a quantitative measure of similarity between the two clustering solutions while considering chance associations.
The Adjusted Rand Index (ARI) ranges from -1 to 1, where:
The calculated Adjusted Rand Index (ARI) value between the hierarchical clustering and KMeans clustering solutions yielded 0.27. This moderate ARI value suggests a slight association between the cluster assignments generated by the two methods.
The disparity in results between K-means and hierarchical clustering could be attributed to their distinct methodologies. K-means partitions data into k clusters based on randomly initialized centroids, while hierarchical clustering constructs a cluster hierarchy without specifying the number of clusters. Additionally, K-means tends to favor spherical clusters and is more sensitive to outliers, whereas hierarchical clustering can accommodate clusters of arbitrary shapes. The selection of an appropriate k value is crucial for K-means, often requiring multiple runs for stability, whereas hierarchical clustering offers a broader perspective on clustering albeit with potential computational intensity.
# Hierarchical Clustering
distance_matrix = linkage(data_2021_standardized, method='complete')
distance_matrix_single = linkage(data_2021_standardized, method='single')
distance_matrix_avg = linkage(data_2021_standardized, method='average')
fig, axs = plt.subplots(1, 3, figsize=(18, 6))
axs[0].set_title('Complete Linkage')
dendrogram(distance_matrix, labels=data_2021_standardized.index, leaf_rotation=90, ax=axs[0])
axs[0].set_xlabel('Countries')
axs[0].set_ylabel('Distance')
axs[1].set_title('Single Linkage')
dendrogram(distance_matrix_single, labels=data_2021_standardized.index, leaf_rotation=90, ax=axs[1])
axs[1].set_xlabel('Countries')
axs[1].set_ylabel('Distance')
axs[2].set_title('Average Linkage')
dendrogram(distance_matrix_avg, labels=data_2021_standardized.index, leaf_rotation=90, ax=axs[2])
axs[2].set_xlabel('Countries')
axs[2].set_ylabel('Distance')
plt.tight_layout()
plt.show()
num_clusters = 4
distance_matrix = linkage(data_2021_standardized, method='complete')
hclust = fcluster(distance_matrix, num_clusters, criterion='maxclust')
data_2021_standardized['hCluster'] = hclust
# K-Means
# Elbow Method
sse = []
k_range = range(1, 11)
for k in k_range:
kmeans = KMeans(n_clusters=k, n_init=20, random_state=0)
kmeans.fit(data_2021_standardized)
sse.append(kmeans.inertia_)
# Silhouette Method
silhouette_scores = []
for k in k_range[1:]: # silhouette_score not defined for k=1
kmeans = KMeans(n_clusters=k, n_init=20, random_state=0)
labels = kmeans.fit_predict(data_2021_standardized)
silhouette_scores.append(silhouette_score(data_2021_standardized, labels))
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 6))
ax1.plot(k_range, sse, marker='o')
ax1.set_title('Elbow Method')
ax1.set_xlabel('Number of clusters')
ax1.set_ylabel('SSE')
ax2.plot(k_range[1:], silhouette_scores, marker='o')
ax2.set_title('Silhouette Method')
ax2.set_xlabel('Number of clusters')
ax2.set_ylabel('Silhouette Score')
plt.show()
num_clusters = 4
kmeans = KMeans(n_clusters=num_clusters, n_init=20, random_state=0)
kmeans.fit(data_2021_standardized)
kmcluster_labels = kmeans.labels_ + 1
data_2021_standardized['kmCluster'] = kmcluster_labels
hcluster_means = data_2021_standardized.groupby('hCluster').mean()
print("Average values for each variable within hierarchical clusters:")
hcluster_means
Average values for each variable within hierarchical clusters:
| total_cases_per_million | total_deaths_per_million | stringency_index | reproduction_rate | population_density | median_age | aged_65_older | aged_70_older | gdp_per_capita | female_smokers | male_smokers | hospital_beds_per_thousand | life_expectancy | human_development_index | kmCluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| hCluster | |||||||||||||||
| 1 | -0.817536 | -0.561662 | 0.696769 | -0.035826 | -0.306504 | -1.433711 | -1.685926 | -1.504972 | -0.961199 | -1.368065 | 1.446886 | 0.554134 | -1.128393 | -1.262921 | 3.000000 |
| 2 | -0.236502 | 0.858461 | -0.284617 | -0.496272 | -0.319998 | 0.266225 | 0.046339 | -0.009748 | -0.728683 | 0.425525 | 0.563132 | 0.626982 | -0.958918 | -0.830926 | 1.900000 |
| 3 | 0.412953 | -0.004733 | -0.002993 | 0.172507 | 0.033586 | 0.572606 | 0.697911 | 0.714149 | 0.252240 | 0.266775 | -0.440733 | -0.331711 | 0.647450 | 0.598523 | 1.764706 |
| 4 | -0.113502 | -1.139167 | -0.117358 | 0.441844 | 0.832308 | -1.045598 | -0.779649 | -0.903637 | 1.560949 | -0.390020 | -1.074660 | -0.680280 | 0.844900 | 0.889794 | 2.000000 |
kmcluster_means = data_2021_standardized.groupby('kmCluster').mean()
print("Average values for each variable within k-means clusters:")
kmcluster_means
Average values for each variable within k-means clusters:
| total_cases_per_million | total_deaths_per_million | stringency_index | reproduction_rate | population_density | median_age | aged_65_older | aged_70_older | gdp_per_capita | female_smokers | male_smokers | hospital_beds_per_thousand | life_expectancy | human_development_index | hCluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| kmCluster | |||||||||||||||
| 1 | -0.146811 | 0.317392 | 0.338570 | 0.413082 | -0.153860 | 0.711870 | 0.601873 | 0.663276 | -0.257086 | 0.696479 | 0.232829 | 0.223238 | 0.017224 | -0.086946 | 2.588235 |
| 2 | -0.164340 | -1.033449 | -0.113413 | 0.492475 | 0.492744 | -0.568650 | -0.112211 | -0.339582 | 1.144334 | -0.358971 | -1.143510 | -0.867052 | 0.776941 | 0.949088 | 3.555556 |
| 3 | -0.817536 | -0.561662 | 0.696769 | -0.035826 | -0.306504 | -1.433711 | -1.685926 | -1.504972 | -0.961199 | -1.368065 | 1.446886 | 0.554134 | -1.128393 | -1.262921 | 1.000000 |
| 4 | 1.343753 | 1.118948 | -1.369802 | -1.879257 | -0.047759 | 0.030769 | -0.132051 | -0.115764 | -0.187091 | -0.294848 | -0.150155 | 0.206291 | -0.273886 | -0.124850 | 2.500000 |
plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
for cluster in range(1, num_clusters + 1):
cluster_data = principal_components[data_2021_standardized['hCluster'] == cluster]
plt.scatter(cluster_data[:, 0], cluster_data[:, 1], label=f'Cluster {cluster}')
for i, country in enumerate(data_2021_standardized.index):
plt.text(principal_components[i, 0] + 0.1, principal_components[i, 1], country,
color='black', fontsize=8, ha='left', va='center')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('Hierarchical Clustering')
plt.legend()
plt.subplot(1, 2, 2)
for cluster in range(1, num_clusters + 1):
cluster_data = principal_components[data_2021_standardized['kmCluster'] == cluster]
plt.scatter(cluster_data[:, 0], cluster_data[:, 1], label=f'Cluster {cluster}')
for i, country in enumerate(data_2021_standardized.index):
plt.text(principal_components[i, 0] + 0.1, principal_components[i, 1], country,
color='black', fontsize=8, ha='left', va='center')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('KMeans Clustering')
plt.legend()
plt.tight_layout()
plt.show()
ari = adjusted_rand_score(data_2021_standardized['hCluster'], data_2021_standardized['kmCluster'])
print("Adjusted Rand Index:", ari)
Adjusted Rand Index: 0.2762193744149262
To sum up, in this project the complicated landscape of the COVID-19 pandemic characterizing the European continent was explored, employing a wide array of data exploration, cleaning, preprocessing, visualization, and modeling techniques. Data exploration and visualization yielded valuable insights into the dynamics of the pandemic across Europe, analyzing metrics such as total cases, deaths, and stringency index to identify temporal trends and geographical variations. Notably, the analysis highlighted the significant impact of containment measures and lockdowns on COVID-19 transmission, showing a clear inverse relationship between the stringency index and the number of total cases. Additionally, by combining social, economic, and public health data, the study revealed how the pandemic affects various aspects of society, highlighting the importance of considering broader societal issues when creating effective response plans. While initial linear regression offered preliminary insights into the connection between containment policies and infections, the modest explanatory capacity indicated the necessity for more sophisticated modeling techniques. Following the experimentation with polynomial regression, the relevance of each feature was examined, and more complex models such as ridge, lasso, and random forest regression were built. Moreover, leveraging principal component analysis (PCA) and clustering methods provided deeper insights into the intricate factors driving variations in COVID-19 dynamics across European nations, revealing some patterns in country categorization. Ultimately, the project underscored the indispensable role of data-driven approaches in informing evidence-based strategies to mitigate the pandemic's societal impact, emphasizing the crucial reliance on empirical evidence in navigating such complex scenarios.